Version: 25.08

Best Practices

General Recommendation for Minimum Replicas

To enhance reliability and resiliency, particularly in environments prone to load spikes, it is recommended to increase the minReplicas to reflect the rough baseline values observed during normal operation. This ensures that the components can handle sudden surges in requests, even after periods of inactivity. A downscaled environment may struggle to cope with unexpected traffic increases, potentially impacting performance.

A practical approach is to take note of the typical number of replicas during regular functioning. The default minReplicas value might be too low, so adjusting it to match the rough average of replicas seen in normal conditions is advised. This should be applied to the following HPAs:

dlp-coordinator
content-inspection-scanner
dlp-tika
dlp-ocr

By setting minReplicas according to this rough baseline, the system can better handle varying loads and maintain stability.

Troubleshooting

When observed success rates fall below the recommended thresholds, it is important to investigate key areas to identify and resolve potential issues. The sections below outline critical aspects to examine, including service stability, component failures, and scalability challenges. By addressing these factors, operational issues within the on-premises content inspection stack can be diagnosed and remediated to restore optimal performance.

Service Stability

Ensure that all deployed services are operational and not experiencing frequent restarts. While occasional restarts are normal, multiple restarts due to crashes or out-of-memory (OOM) issues require immediate attention. If such problems are detected, it's recommended to contact Cyberhaven with the logs from the crashed container, including the exit code and any relevant context.

Identify Failing Components

Failures can often be isolated to a specific component or path. The dlp_coordinator_fatal_errors_per_component metric can help identify which components are experiencing issues. If the number of failures significantly exceeds those observed in previous time windows where the system was operating normally, it's essential to check the logs of the affected containers. Failures may be caused by missing permissions or communication issues with the Cyberhaven SaaS platform, which should be evident in the logs. These problems can sometimes arise after operational changes, such as modifications to the service accounts used for authentication to SaaS services. If the issue cannot be resolved internally, it is advisable to contact Cyberhaven with the logs and relevant metrics for further diagnosis.

Scalability Issues

A drop in success rates may indicate scalability problems if the components are not scaling up as expected. It is recommended to monitor the number of replicas for each horizontally scalable component (e.g., content-inspection-scanner, dlp-coordinator, dlp-tika) to check for any decrease in replica counts or changes in scaling behavior. Additionally, investigating Horizontal Pod Autoscaler (HPA) failures could reveal resource constraints, such as underprovisioning (i.e., inability to scale due to lack of resources in the node pool) or missing scaling metrics. As an immediate remedy, consider manually scaling up the component by increasing the minimum number of replicas beyond the levels observed during periods of normal operation, and contact Cyberhaven for further support.

Endpoint Agent impact with delayed and failed Content Inspection

To be added by Cyberhaven

On-prem Content Inspection logs troubleshooting

To be added by Cyberhaven

Endpoint Agent side content inspection metrics

The Endpoint Agent side content inspection metrics have been made available as a dashboard in Visual Analytics.
In that dashboard, the error categories are defined as follows.

http_error : Any http error returned by the DLP proxy or by any firewall in between. If the errors are very high or a spike is seen, we need to check DLP pod health.
connection_error: DNS Errors, socket error etc usually seen when the endpoint loses connectivity or switches between networks. These should get resolved when connectivity is resolved.
app_error: Something went wrong on the endpoint (data encoding, Invalid response from the DLP service etc). If the errors are very high or a spike is seen, we need a diag bundle from the affected endpoints to investigate.
timeout: The endpoint timed out while waiting for results from the backend. This is our own timeout and it may differ from any http timeout configured on the backend or any in between gateway killing the connection. If the errors are very high or a spike is seen, we need to check DLP pod health.

General Recommendation for Minimum Replicas​

Troubleshooting​

Service Stability​

Identify Failing Components​

Scalability Issues​

Endpoint Agent impact with delayed and failed Content Inspection​

On-prem Content Inspection logs troubleshooting​

Endpoint Agent side content inspection metrics​

Endpoint Agent Logs​